This project required the forecasting of sales amounts for 7 different ids, for the next 90 days. This is to allow for the business to plan inventory and expected sales.
The data is provided as weekly time series data. As such, investigation or features for smaller time frames are not relevant.
Each time series is identified by a unique id: 1_1, 1_3, 1_8, 1_13,
1_38, 1_93, and 1_95. All time series’ begin on February 5, 2010 and end
on October 26, 2012. As such, each time series is 143 weeks.
Time series data can be decomposed into it’s components of trend, seasonality, and noise.
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
## frequency = 13 observations per 1 quarter
## trend = 52 observations per 1 year
The frequency observed is 13 observations per quarter, with the trend having 52 observations per year. This is consistent with our previous manual observation of the data.
In order to apply the analysis to each of the time series simultaneously, the data is nested and grouped by id. A future length of 90 days has been included, to allow for the required prediction, as per the business case above. The time series’ are also split into training and testing sets to allow for proper model validation.
nested_data_tbl <- data_tbl %>%
group_by(id) %>%
extend_timeseries( # adds in future timestamps, where the data will be missing
.id_var = id,
.date_var = Date,
.length_future = 90
) %>%
nest_timeseries( # adds nested dataframe, value and date as actual and future
.id_var = id,
.length_future = 90
) %>%
split_nested_timeseries( # creates train and test splits on actual data
.length_test = 90 # 90 days is the length of the test data set
)
The method adopted is to use many different models, with different parameters. This will allow the best models for each individual time series to be selected. It will also, at a later stage, allow for the creation of ensembles of models.
All told, 8 different algorithms are to be used. Some algorithms will have multiple versions with different parameters to allow for better models to be found for each of the individual series. It is not possible in advance to know which algorithms will be better suited for which time series (or which combination). The algorithms used include: - ARIMA - ARIMA boosted - ETS (Exponential Smoothing) - Prophet - Linear Models - MARS (Multivariate Adaptive Regression Spline) - XGBoost - GLM (Generalised Linear Models)
Some algorithms are able to use additional variables / features to model the data. Where this is possible, these features have been created. All features have been converted to a format that works for the individual algorithm.
An Auto ARIMA model is created
model_fit_arima <- arima_reg(seasonal_period = 52) %>%
set_engine("auto_arima") %>%
fit(Weekly_Sales ~ Date, extract_nested_train_split(nested_data_tbl))
#model_fit_arima
3 Boosted ARIMA models are created, with different learning rates
model_fit_arima_boosted_1 <- arima_boost(
seasonal_period = 52,
min_n = 2,
learn_rate = 0.0015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(Weekly_Sales ~ Date + as.numeric(Date) + factor(months(Date)),
data = extract_nested_train_split(nested_data_tbl))
#model_fit_arima_boosted_1
# boosted arima 2 ----
model_fit_arima_boosted_2 <- arima_boost(
seasonal_period = 52,
min_n = 2,
learn_rate = 0.015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(Weekly_Sales ~ Date + as.numeric(Date) + factor(months(Date)),
data = extract_nested_train_split(nested_data_tbl))
#model_fit_arima_boosted_2
# boosted arima 3 ----
model_fit_arima_boosted_3 <- arima_boost(
seasonal_period = 52,
min_n = 2,
learn_rate = 0.150
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(Weekly_Sales ~ Date + as.numeric(Date) + factor(months(Date)),
data = extract_nested_train_split(nested_data_tbl))
#model_fit_arima_boosted_3
An ETS model is created
model_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(Weekly_Sales ~ Date, data = extract_nested_train_split(nested_data_tbl))
## frequency = 13 observations per 1 quarter
A Prophet model is created
model_fit_prophet <- prophet_reg(
seasonality_yearly = TRUE
) %>%
set_engine(engine = "prophet") %>%
fit(Weekly_Sales ~ Date, data = extract_nested_train_split(nested_data_tbl))
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
A Linear Model is created
model_fit_lm <- linear_reg() %>%
set_engine("lm") %>%
fit(Weekly_Sales ~ as.numeric(Date) + factor(months(Date), ordered = FALSE),
data = extract_nested_train_split(nested_data_tbl))
A Multivariate Adaptive Regression Spline model is created. For this model, additional date features are created which may help improve the performance of the model.
model_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
recipe_spec_mars <- recipe(Weekly_Sales ~ Date, data = extract_nested_train_split(nested_data_tbl)) %>%
step_date(Date, features = "month", ordinal = FALSE) %>%
step_mutate(date_num = as.numeric(Date)) %>%
step_normalize(date_num) %>%
step_rm(Date)
wflw_fit_mars <- workflow() %>%
add_recipe(recipe_spec_mars) %>%
add_model(model_spec_mars) %>%
fit(extract_nested_train_split(nested_data_tbl))
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
4 XGBoost models are created. Each uses a different learn rate in order to allow for better models to be present for different time series. Date features are also added to these models in order to create better models. The variables are transformed into a format that works for the algorithm.
rec_xgb <- recipe(Weekly_Sales ~ ., extract_nested_train_split(nested_data_tbl)) %>%
step_timeseries_signature(Date) %>% # creates calenader features
step_rm(Date) %>% # removes date feature for xgboost
step_zv(all_predictors()) %>% # removes zero variance predictors
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% # dummy any nominal predictors
step_integer(all_logical_predictors())
wflw_fit_xgb_1 <- workflow() %>%
add_model(boost_tree("regression", learn_rate = 0.50) %>% set_engine("xgboost")) %>%
add_recipe(rec_xgb)
wflw_fit_xgb_2 <- workflow() %>%
add_model(boost_tree("regression", learn_rate = 0.35) %>% set_engine("xgboost")) %>%
add_recipe(rec_xgb)
wflw_fit_xgb_3 <- workflow() %>%
add_model(boost_tree("regression", learn_rate = 0.15) %>% set_engine("xgboost")) %>%
add_recipe(rec_xgb)
wflw_fit_xgb_4 <- workflow() %>%
add_model(boost_tree("regression", learn_rate = 0.05) %>% set_engine("xgboost")) %>%
add_recipe(rec_xgb)
3 Generalised Linear Models have been created. The penalty has been adjusted for each model to improve the likelihood that a good model can be found for different time series.
model_fit_glmnet_1 <- linear_reg(penalty = 0.01) %>%
set_engine("glmnet") %>%
fit(
Weekly_Sales ~ as.numeric(Date)
+ factor(months(Date), ordered = FALSE),
extract_nested_train_split(nested_data_tbl)
)
# GLM 2
model_fit_glmnet_2 <- linear_reg(penalty = 0.05) %>%
set_engine("glmnet") %>%
fit(
Weekly_Sales ~ as.numeric(Date)
+ factor(months(Date), ordered = FALSE),
extract_nested_train_split(nested_data_tbl)
)
# GLM 3
model_fit_glmnet_3 <- linear_reg(penalty = 0.1) %>%
set_engine("glmnet") %>%
fit(
Weekly_Sales ~ as.numeric(Date)
+ factor(months(Date), ordered = FALSE),
extract_nested_train_split(nested_data_tbl)
)
This section uses parallel processing so that modelling can be undertaken in parallel (at the same time). The machine used has 24 cores - as such parallel processing utilises all of these.
The models are then fitted to the training set here.
parallel_start(24) # for 24 clusters (i.e. cores)
nested_modeltime_tbl <- nested_data_tbl %>%
modeltime_nested_fit(
model_list = list(
model_fit_arima,
model_fit_arima_boosted_1,
model_fit_arima_boosted_2,
model_fit_arima_boosted_3,
model_fit_ets,
model_fit_prophet,
model_fit_lm,
wflw_fit_mars,
wflw_fit_xgb_1,
wflw_fit_xgb_2,
wflw_fit_xgb_3,
wflw_fit_xgb_4,
model_fit_glmnet_1,
model_fit_glmnet_2,
model_fit_glmnet_3
),
control = control_nested_fit(
verbose = FALSE,
allow_par = TRUE
)
)
nested_modeltime_tbl
## # Nested Modeltime Table
##
## Trained on: .splits | Model Errors: [0]
## # A tibble: 7 × 5
## id .actual_data .future_data .splits .modeltime_tables
## <fct> <list> <list> <list> <list>
## 1 1_1 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
## 2 1_3 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
## 3 1_8 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
## 4 1_13 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
## 5 1_38 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
## 6 1_93 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
## 7 1_95 <tibble [143 × 16]> <tibble [90 × 16]> <split [53|90]> <mdl_tm_t>
nested_modeltime_tbl %>%
extract_nested_test_accuracy() %>%
table_modeltime_accuracy()
The various error measurements can be used to select for each time series id, which models might be useful in an Ensemble of models
By examining which models are most accurate for each of the time series, the best models are selected for each, and Ensembles of these models are created. 3 different kinds of ensembles are created: Mean of the component models, Median of the component models, and a weighted average of the component models. The weights are chosen based on the relative accuracy of each model included.
ensem <- nested_modeltime_tbl %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(4, 5, 8, 9, 10), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem %>% extract_nested_modeltime_table()
ensem_2 <- ensem %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(4, 5, 8, 9, 10), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_2 %>% extract_nested_modeltime_table()
ensem_3 <- ensem_2 %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(3, 4, 5, 6, 7, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_3 %>% extract_nested_modeltime_table()
ensem_4 <- ensem_3 %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(3, 4, 5, 6, 7, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_4 %>% extract_nested_modeltime_table()
ensem_5 <- ensem_4 %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_5 %>% extract_nested_modeltime_table()
ensem_6 <- ensem_5 %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_6 %>% extract_nested_modeltime_table()
ensem_7 <- ensem_6 %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(1, 2, 3, 4, 9, 10), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_7 %>% extract_nested_modeltime_table()
ensem_8 <- ensem_7 %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(1, 2, 3, 4, 9, 10), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
ensem_9 <- ensem_8 %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(1, 9, 10, 11), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_9 %>% extract_nested_modeltime_table()
ensem_10 <- ensem_9 %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(1, 9, 10, 11), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_10 %>% extract_nested_modeltime_table()
ensem_11 <- ensem_10 %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(2, 3, 6, 7, 9, 10, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_11 %>% extract_nested_modeltime_table()
ensem_12 <- ensem_11 %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(2, 3, 6, 7, 9, 10, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_12 %>% extract_nested_modeltime_table()
ensem_13 <- ensem_12 %>%
ensemble_nested_average(
type = "mean",
keep_submodels = TRUE,
model_ids = c(6, 7, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_13 %>% extract_nested_modeltime_table()
ensem_14 <- ensem_13 %>%
ensemble_nested_average(
type = "median",
keep_submodels = TRUE,
model_ids = c(6, 7, 13, 14, 15), # select the "best" models
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
#ensem_14 %>% extract_nested_modeltime_table()
model_ids <- c(4, 5, 8, 9, 10)
ensem_15 <- ensem_14 %>%
ensemble_nested_weighted(
loadings = c(4, 1, 3, 1, 1), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
model_ids <- c(3, 4, 5, 6, 7, 13, 14, 15)
ensem_16 <- ensem_15 %>%
ensemble_nested_weighted(
loadings = c(1, 2, 1, 1, 3, 3, 3, 3), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
model_ids <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15)
ensem_17 <- ensem_16 %>%
ensemble_nested_weighted(
loadings = c(2, 2, 2, 2, 2, 1, 3, 1, 1, 1, 3, 3, 3), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
model_ids <- c(1, 2, 3, 4, 9, 10)
ensem_18 <- ensem_17 %>%
ensemble_nested_weighted(
loadings = c(2, 2, 2, 2, 1, 1), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
model_ids <- c(1, 9, 10, 11)
ensem_19 <- ensem_18 %>%
ensemble_nested_weighted(
loadings = c(1, 3, 3, 2), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
model_ids <- c(2, 3, 6, 7, 9, 10, 13, 14, 15)
ensem_20 <- ensem_19 %>%
ensemble_nested_weighted(
loadings = c(1, 1, 2, 1, 1, 3, 1, 1, 1), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
model_ids <- c(4, 5, 96, 7, 13, 14, 15)
ensem_21 <- ensem_20 %>%
ensemble_nested_weighted(
loadings = c(1, 2, 1, 1, 1), # determine from mae of the chosen models
scale_loadings = TRUE,
metric = "rmse",
keep_submodels = TRUE,
model_ids = model_ids,
control = control_nested_fit(allow_par = TRUE, verbose = FALSE)
)
Includes all ensembles used. The accuracy is shown for each time series by each model.
ensem_21 %>%
extract_nested_test_accuracy() %>%
table_modeltime_accuracy()
rmse: root mean squared error was chosen as the metric to select the best model
nested_best_tbl <- ensem_21 %>%
modeltime_nested_select_best(metric = "rmse")
The plot below shows the best performing model for each time series.
nested_best_tbl %>%
extract_nested_test_forecast() %>%
group_by(id) %>%
plot_modeltime_forecast(.facet_ncol = 3)
This is done for the whole dataset to give a good prediction for the next 90 days.
nested_best_refit_tbl <- nested_best_tbl %>%
modeltime_nested_refit(
control = control_refit(
verbose = FALSE,
allow_par = TRUE
)
)
# Save refitted model
# nested_best_refit_tbl %>% write_rds("nested_best_refit_tbl.rds")
Done for each time series - Greyed areas are the margin for error
nested_best_refit_tbl %>%
extract_nested_future_forecast() %>%
group_by(id) %>%
plot_modeltime_forecast(.facet_ncol = 3)
Stop parallel processing
parallel_stop()